Step 0. Mapping and variant calling
Prior to running Step 1 (Coverage-based analysis) or any of the analyses from Step 2 (Sequence-based analyses) except the reference-free k-mer analysis, you will need to map your reads to a reference genome.
In the SexFindR paper, we used Bowtie2 (v 2.3.4.3; http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) for read mapping and we provide some scripts and configuration for this within the GitHub repo and below. If you already have mapped reads from bwa-mem or another widely-used algorithm, please feel free to use those.
For all the sequence-based analyses mapped to a reference genome, SNP calling is also required. In the SexFindR paper, we used Platypus (commit 3e72641; https://github.com/andyrimmer/Platypus) to jointly call SNPs across all samples, and we provide some scripts and configurations for this within the GitHub and below. Again, if you have already called SNPs for your samples using GATK or another widely-used algorithm, please feel free to use those.
Platypus requires that the genome file be indexed with samtools (v1.10; https://github.com/samtools/samtools). We also filtered the raw vcf to only include sites that PASS the quality filters. This requires bcftools (v1.9; https://github.com/samtools/bcftools). vcftools is also used to filter for biallelic sites (v0.1.14; https://github.com/vcftools/vcftools).
Download raw reads from NCBI
Requires fasterq-dump from the SRA-tools (https://github.com/ncbi/sra-tools). In my experience, fasterq-dump can be fairly flakey depending on your system and connection, so this download might need to be repeated multiple times to get all the samples. You might want to separate out the acc_list.txt files or add something like if [ ! -f ${file}_1.fastq ] to the code below to avoid downloading the same file repeatedly.
for file in $(cat acc_list.txt); do echo $file; date; fasterq-dump $file -e 36; done
Download the genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/901/000/725/GCF_901000725.2_fTakRub1.2/GCF_901000725.2_fTakRub1.2_genomic.fna.gz
gunzip GCF_901000725.2_fTakRub1.2_genomic.fna.gz
Create a bowtie2 index
bash bowtie2_makeindex_linux.sh GCF_901000725.2_fTakRub1.2_genomic.fna fugu &> index_outerr.txt &
Map reads to the genome
For each sample, the reads must be mapped using a command like:
bash bowtie2_16_linux.sh SRR8585991_* fugu &> bt2_SRR8585991_outerr.txt &
bowtie2_16_long.sh is also included as a reference on SLURM systems.
Call variants using Platypus
Variants are jointly called (all at once) through the use of a bam list (e.g., 15M_14F_bams.txt). The following was run on a SLURM system:
samtools faidx GCF_901000725.2_fTakRub1.2_genomic.fna
sbatch platypus_all_region_1day.sh 15M_14F_bams.txt GCF_901000725.2_fTakRub1.2_genomic.fna
Filter for variant calls that PASS quality filters
This script will keep only those variants that have PASS in the FILTER field, removing low quality calls.
sbatch bcf_filter.sh fugu_14M_13F.vcf
Filter for biallelic sites
Some downstream analyses (e.g., Fst) require only biallelic sites to work properly (e.g., 0/0, 0/1, 1/1). Execute the following to filter for only biallelic sites:
vcftools --vcf filtered_PASS_fugu_14M_13F.vcf --max-alleles 2 --stdout --recode --recode-INFO-all | gzip -c > biallelic_filtered_PASS_fugu_14M_13F.vcf.gz